Agreement and reliability of hepatic transient elastography in patients with chronic hepatitis C: A cross‐sectional test–retest study

Abstract Background and Aims Transient elastography (TE) has largely replaced liver biopsy to evaluate fibrosis stage and cirrhosis in chronic hepatitis C. Previous studies have reported excellent reliability of TE but agreement metrics have not been reported. This study aimed to assess interrater agreement and reliability of repeated TE measurements. Methods Two operators performed TE independently, directly after each other. The primary outcome was disagreement, defined as a difference in TE results between operators of ≥33%, as well as the smallest detectable change, SDC95 (i.e., the difference between measurements needed to state with 95% certainty that there is a difference in underlying stiffness). Secondary outcomes included reliability, measured as intraclass correlation (ICC), and patient and examination characteristics associated with the agreement. Results In total, 65 patients were included, with a mean liver stiffness of 9.7 kPa. Of these, 21 (32%) had a disagreement in TE results of ≥33% between the two operators. The SDC95 on the log scale was 1.97, indicating that an almost twofold increase or decrease in liver stiffness would be required to confidently represent a change in the underlying fibrosis. Reliability, estimated using the ICC, was acceptable at 0.86. In a post hoc analysis, fasting less than 5 h before TE was associated with a higher degree of disagreement (48% vs. 19%, p = 0.03). Conclusions In our clinical setting, interrater agreement in directly repeated TE measurements was surprisingly low. It is essential to further investigate the reliability and agreement of TE to determine its validity and usefulness.


| INTRODUCTION
Early identification of fibrosis and cirrhosis of the liver enables prevention and treatment of long-term consequences of chronic hepatitis C, including liver failure, hepatocellular cancer, and death. 1 Even where antiviral treatment is generally offered, fibrosis and cirrhosis staging affect decisions regarding surveillance, for hepatocellular cancer and esophageal varices. 2 Traditionally, liver biopsy has been the reference standard of fibrosis and cirrhosis assessment, despite being associated with discomfort and complications. 3 Transient elastography (TE) is a noninvasive method for measuring liver stiffness that has replaced liver biopsy in a majority of patients with chronic hepatitis C. 4 Due to its noninvasive nature, TE is also suitable for longitudinal monitoring.
Previous studies suggest comparable accuracy between liver biopsy and TE to detect fibrosis in hepatitis C patients. 5,6 Studies have reported excellent test-retest reliability for TE. [7][8][9][10][11] Reliability refers to an instrument's ability to discriminate between study subjects and may be excellent even with considerable measurement error if the population is heterogenous. [12][13][14][15] Agreement, in contrast, refers to differences in measurements on the original scale (kPa for TE). Whether the agreement is acceptable is a situation-dependent clinical consideration and can be related to the smallest detectable change (SDC). 13,16 SDC corresponds to the following clinical question: if a patient has performed TE previously, how much higher, or lower, must a new TE result be to confidently represent a change in underlying fibrosis? This question cannot be addressed using reliability statistics alone and, to the best of our knowledge, agreement metrics or SDC have not been addressed in previous studies.
The aim of the present study was to assess agreement, SDC, and reliability of repeated measurements of liver stiffness of TE in chronic hepatitis C patients, and to explore factors associated with disagreement.

| TE procedures
Liver stiffness was measured in kPa using the Fibroscan ® device (Echosens), for which the reported kPa value was the median of 10 consecutive measurements. The interquartile range divided by the median (IQR %) was noted. The TE was considered valid when 10 consecutive successful measurements had been gathered, and the IQR % was lower than 30%, as in the clinical routine. The Fibroscan ® was operated by a total of nine individuals: eight medical doctors and one specially trained nurse. The majority (n = 6) of the medical doctors were specialists in infectious diseases, and two were resident physicians. All operators had been certified by the company behind Fibroscan ® , but none of the operators were experienced operators, as all had performed less than 500 exams. The TE experience of each observer is detailed in Supporting Information: Table S4.
Patients were put in the recommended body position, with the right arm on top of the head and legs positioned towards the left.
This body position was unaltered during the entire procedure. The probe position for the previous measurement was not known but typical marks generated by the probe on the patient remained.
Operators could select a medium or XL probe at their own discretion based on the patient's configuration.

| Outcomes
The primary outcome was disagreement, expressed as (1) the proportion of participants with interrater differences at or above our prespecified threshold (33%) and (2) the SDC, representing the difference needed to state with 95% certainty that a change had occurred in the underlying fibrosis (SDC 95 ). Reliability was the secondary outcome, for continuous stiffness measurements as well as for fibrosis stage categories. 18 As an exploratory outcome, patient and exam characteristics associated with interrater differences above our prespecified threshold of ≥33% in liver stiffness were evaluated.

| Agreement
The first rating for each operator was used for interrater analysis. We estimated the 95% limits of agreement (LOA 95 , the range encompassing 95% of differences), as described by Bland and Altman. 16 We expected heteroscedasticity (i.e., larger differences at the higher end of the scale) and measurements were transformed using the natural logarithm. The standard error of measurement (SEM) was estimated on the log scale and SDC 95 was derived from SEM. 13

| Reliability
A one-way random effects intraclass correlation (ICC 1, 1 ) was used to estimate reliability in continuous kPa. 13,16 Cohen's κ was evaluated for the F score as well as separately for the dichotomous outcome of F0-3 versus F4 (cirrhosis no vs. yes). ICC and κ values are expressed with 95% confidence intervals (CIs) within brackets. Intrarater agreement and reliability were determined using the same methods.

| Factors associated with discordance in liver stiffness
The other baseline variables were explored for associations with the dichotomous disagreement outcome, using nonparametric tests throughout.

| Participant and liver stiffness characteristics
In total, 66 patients were asked to participate, of which one declined, and 65 were included. In 60 patients, four (2 + 2) valid measurements were obtained by the two operators, and in 5 patients, three (2 + 1) measurements were obtained (in 1 patient, one measurement was invalid, and in 4 patients, the second rating was not performed).
There were 255 examinations performed overall. For interrater analysis, 130 measurements were used (65 × 2) and for intrarater analysis, 250 measurements (125 × 2) were used. The patient and examination characteristics are displayed in Table 1 and Table 2, respectively.

| Agreement
The results of the first and second operators are displayed in   Bland-Altman plots confirmed heteroscedasticity and the need for log transformation, see Figure 3A,B as well as Supporting Information: Tables S1 and S2. Furthermore, one extreme outlier affected the results and was addressed in a sensitivity analysis, see

| Intrarater analysis
In the intrarater analysis, all metrics were better. The

| Factors associated with interrater disagreement
When associations with disagreement ≥33% were explored, none of the nine individual operators were under-or overrepresented (all operators with ≥5 examinations were involved in discordant ratings between 29% and 40%). The only factor associated with a disagreement ≥33% was the shorter duration of fasting before elastography, see Table 3. Due to this association, we performed a post hoc analysis, where interrater differences were plotted versus fasting time (see Supporting Information: Figure S5)

| DISCUSSION
This cross-sectional study aimed to assess agreement, SDC, and reliability of repeated measurements of liver stiffness of TE in chronic hepatitis C patients, and to explore factors associated with disagreement. The interrater agreement has not been previously F I G U R E 1 Scatterplot of LS ratings by the operator, in kPa. Each dot represents a subject with the rating according to the first operator on the x axis and according to the second on the y axis. The dotted line represents perfect agreement. LS, liver stiffness.
F I G U R E 2 Ratings of liver stiffness by operator 1 and operator 2, in F scores, categorized according to Castera. 17 reported, and in our setting 32% of patients had an interrater disagreement above our prespecified threshold. Furthermore, an almost twofold increase or decrease in kPa was required to represent a change in the underlying fibrosis with 95% certainty. In a post hoc analysis, we found that longer fasting time before TE was associated with better interrater agreement.  21 However, we found that the variability seemed to be increased up to 5 h after food intake, although in a secondary analysis. Previously, high body mass index, liver biomarkers, and high IQR % have also been associated with invalid TE measurements. 22 We did not find an association between these factors and TE interrater variability. Findings related to operator experience have been conflicting. 8,[22][23][24] Our results did not suggest systematic differences between operators, although statistical power was insufficient for formal analysis. In two participants, the two operators used different probes (medium and XL), resulting in large kPa differences, as seen in Table 3.
Reliability and agreement metrics were much better in intrarater than interrater analysis, indicating that the change of operator Even so, the intrarater SDC 95 was 1.40, signaling a higher variability than the IQR % for the 10 readings of each result. This may be explained by the fact that, in the intrarater situation, the probe was removed and then replaced, in contrast to the 10 repeated measurements of each LS result. The protocol did not specify that the same probe location should be used, and previous studies suggest variability due to probe location. 25 This study has several limitations. We used many operators, who did not have extensive experience in an international context.

CONFLICTS OF INTEREST STATEMENT
The authors declare no conflict of interest.

DATA AVAILABILITY STATEMENT
The underlying data set cannot be publicly shared, due to ethical concerns. The underlying data may be requested by individual researchers from the corresponding author upon reasonable request if approval is granted by the Swedish Ethical Review Authority. Such requests will also be evaluated by the Data Protection Officer at Skåne University Hospital.

ETHICS STATEMENT
The Regional Ethical Review Board in Lund, Sweden, approved the study (2018-688). All patients included in the study signed informed consent. All methods were carried out in accordance with the declaration of Helsinki and the Guidelines for Reporting Reliability and Agreement Studies guidelines. 26

TRANSPARENCY STATEMENT
The lead author Oskar Ljungquist affirms that this manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained.